AITopics

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.81)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.60)

arXiv.org Artificial IntelligenceMar-9-2025

Policy Regularization on Globally Accessible States in Cross-Dynamics Reinforcement Learning

Xue, Zhenghai, Feng, Lang, Xu, Jiacheng, Kang, Kang, Wen, Xiang, An, Bo, Yan, Shuicheng

To learn from data collected in diverse dynamics, Imitation from Observation (IfO) methods leverage expert state trajectories based on the premise that recovering expert state distributions in other dynamics facilitates policy learning in the current one. However, Imitation Learning inherently imposes a performance upper bound of learned policies. Additionally, as the environment dynamics change, certain expert states may become inaccessible, rendering their distributions less valuable for imitation. To address this, we propose a novel framework that integrates reward maximization with IfO, employing F-distance regularized policy optimization. This framework enforces constraints on globally accessible states--those with nonzero visitation frequency across all considered dynamics--mitigating the challenge posed by inaccessible states. By instantiating F-distance in different ways, we derive two theoretical analysis and develop a practical algorithm called Accessible State Oriented Policy Regularization (ASOR). ASOR serves as a general add-on module that can be incorporated into various RL approaches, including offline RL and off-policy RL. Extensive experiments across multiple benchmarks demonstrate ASOR's effectiveness in enhancing state-of-the-art cross-domain policy transfer algorithms, significantly improving their performance.

algorithm, globally accessible state, state distribution, (12 more...)

2503.06893

Country:

Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Neural Information Processing SystemsJan-27-2025, 19:21:26 GMT

Review for NeurIPS paper: Promoting Coordination through Policy Regularization in Multi-Agent Deep Reinforcement Learning

Summary and Contributions: Based on rebuttal and discussion: Upon reading all reviews, I recognize that we agree the article is well presented, and I stand by the concerns I raised. Note that I primarily criticized the absence of some relevant context in the original submission (which the authors admit in their rebuttal), rather than the contribution itself (albeit it may be smaller than proclaimed). Their refutation of it being a planning setting is fair. While I maintain that it is a self-play setting, this is implied by CTDE and thus not necessary to state again. A stale flavor remains from overselling their contribution's novelty in the introduction [L36-45].

multi-agent deep reinforcement learning, policy regularization, promoting coordination, (7 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)

Neural Information Processing SystemsJan-27-2025, 19:21:18 GMT

Review for NeurIPS paper: Promoting Coordination through Policy Regularization in Multi-Agent Deep Reinforcement Learning

Originally, there was some disagreement between reviewers on this paper, but after rebuttal and careful discussion between reviewers and AC, all agree that the paper is interesting and has merit and could be proposed for acceptance as poster. One critical reviewer now recognises that the predictability idea is neat and the concern about positioning of the work has been largely clarified. Reviewers agree there is a contribution to joint exploration in MAS, which is one of the bottlenecks that deserve being addressed and discussed.

multi-agent deep reinforcement learning, policy regularization, promoting coordination, (1 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)

Neural Information Processing SystemsOct-11-2024, 04:09:31 GMT

Promoting Coordination through Policy Regularization in Multi-Agent Deep Reinforcement Learning

multi-agent deep reinforcement learning, policy regularization, promoting coordination

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.44)

arXiv.org Artificial IntelligenceMay-28-2024

Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation

Bai, Fengshuo, Zhao, Rui, Zhang, Hongming, Cui, Sijia, Wen, Ying, Yang, Yaodong, Xu, Bo, Han, Lei

Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering. However, a notable limitation of PbRL is its dependency on substantial human feedback. This dependency stems from the learning loop, which entails accurate reward learning compounded with value/policy learning, necessitating a considerable number of samples. To boost the learning loop, we propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques. Label smoothing reduces overfitting of the reward model by smoothing human preference labels. Additionally, we bootstrap a conservative estimate $\widehat{Q}$ using well-supported state-action pairs from the current replay memory to mitigate overestimation bias and utilize it for policy learning regularization. Our experimental results across a variety of complex tasks, both in online and offline settings, demonstrate that our approach improves feedback efficiency, outperforming state-of-the-art methods by a large margin. Ablation studies further reveal that SEER achieves a more accurate Q-function compared to prior work.

equation, international conference, learning, (12 more...)

2405.18688

Country:

North America > Canada > Alberta (0.14)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMay-27-2024

Q-value Regularized Transformer for Offline Reinforcement Learning

Hu, Shengchao, Fan, Ziqing, Huang, Chaoqin, Shen, Li, Zhang, Ya, Wang, Yanfeng, Tao, Dacheng

Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Conditional Sequence Modeling (CSM), a paradigm that learns the action distribution based on history trajectory and target returns for each state. However, these methods often struggle with stitching together optimal trajectories from sub-optimal ones due to the inconsistency between the sampled returns within individual trajectories and the optimal returns across multiple trajectories. Fortunately, Dynamic Programming (DP) methods offer a solution by leveraging a value function to approximate optimal future returns for each state, while these techniques are prone to unstable learning behaviors, particularly in long-horizon and sparse-reward scenarios. Building upon these insights, we propose the Q-value regularized Transformer (QT), which combines the trajectory modeling ability of the Transformer with the predictability of optimal future returns from DP methods. QT learns an action-value function and integrates a term maximizing action-values into the training loss of CSM, which aims to seek optimal actions that align closely with the behavior policy. Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods, highlighting the potential of QT to enhance the state-of-the-art in offline RL.

learning, q-value regularized transformer, trajectory, (11 more...)

2405.17098

Country:

Europe > Austria > Vienna (0.14)
Asia > China > Shanghai > Shanghai (0.04)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceFeb-27-2024

COPR: Continual Human Preference Learning via Optimal Policy Regularization

Zhang, Han, Gui, Lin, Lei, Yu, Zhai, Yuanzhao, Zhang, Yehong, He, Yulan, Wang, Hui, Yu, Yue, Wong, Kam-Fai, Liang, Bin, Xu, Ruifeng

Reinforcement Learning from Human Feedback (RLHF) is commonly utilized to improve the alignment of Large Language Models (LLMs) with human preferences. Given the evolving nature of human preferences, continual alignment becomes more crucial and practical in comparison to traditional static alignment. Nevertheless, making RLHF compatible with Continual Learning (CL) is challenging due to its complex process. Meanwhile, directly learning new human preferences may lead to Catastrophic Forgetting (CF) of historical preferences, resulting in helpless or harmful outputs. To overcome these challenges, we propose the Continual Optimal Policy Regularization (COPR) method, which draws inspiration from the optimal policy theory. COPR utilizes a sampling distribution as a demonstration and regularization constraints for CL. It adopts the Lagrangian Duality (LD) method to dynamically regularize the current policy based on the historically optimal policy, which prevents CF and avoids over-emphasizing unbalanced objectives. We also provide formal proof for the learnability of COPR. The experimental results show that COPR outperforms strong CL baselines on our proposed benchmark, in terms of reward-based, GPT-4 evaluations and human assessment. Furthermore, we validate the robustness of COPR under various CL settings, including different backbones, replay memory sizes, and learning orders.

continual human preference learning, large language model, machine learning, (16 more...)

2402.14228

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(4 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Leisure & Entertainment (1.00)
Education (1.00)
Media > Film (0.94)
Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Wang, Zhendong, Hunt, Jonathan J, Zhou, Mingyuan

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

arXiv.org Artificial IntelligenceAug-25-2023

Offline reinforcement learning (RL), which aims to learn an optimal policy using a previously collected static dataset, is an important paradigm of RL. Standard RL methods often perform poorly in this regime due to the function approximation errors on out-of-distribution actions. While a variety of regularization methods have been proposed to mitigate this issue, they are often constrained by policy classes with limited expressiveness that can lead to highly suboptimal solutions. In this paper, we propose representing the policy as a diffusion model, a recent class of highly-expressive deep generative models. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. In our approach, we learn an action-value function and we add a term maximizing action-values into the training loss of the conditional diffusion model, which results in a loss that seeks optimal actions that are near the behavior policy. We show the expressiveness of the diffusion model-based policy, and the coupling of the behavior cloning and policy improvement under the diffusion model both contribute to the outstanding performance of Diffusion-QL. We illustrate the superiority of our method compared to prior works in a simple 2D bandit example with a multimodal behavior policy. We then show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.

diffusion model, machine learning, reinforcement learning, (16 more...)

2208.06193

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

arXiv.org Artificial IntelligenceAug-15-2023

Policy Regularization with Dataset Constraint for Offline Reinforcement Learning

Ran, Yuhang, Li, Yi-Chen, Zhang, Fuxiang, Zhang, Zongzhang, Yu, Yang

We consider the problem of learning the best possible policy from a fixed dataset, known as offline Reinforcement Learning (RL). A common taxonomy of existing offline RL works is policy regularization, which typically constrains the learned policy by distribution or support of the behavior policy. However, distribution and support constraints are overly conservative since they both force the policy to choose similar actions as the behavior policy when considering particular states. It will limit the learned policy's performance, especially when the behavior policy is sub-optimal. In this paper, we find that regularizing the policy towards the nearest state-action pair can be more effective and thus propose Policy Regularization with Dataset Constraint (PRDC). When updating the policy in a given state, PRDC searches the entire dataset for the nearest state-action sample and then restricts the policy with the action of this sample. Unlike previous works, PRDC can guide the policy with proper behaviors from the dataset, allowing it to choose actions that do not appear in the dataset along with the given state. It is a softer constraint but still keeps enough conservatism from out-of-distribution actions. Empirical evidence and theoretical analysis show that PRDC can alleviate offline RL's fundamentally challenging value overestimation issue with a bounded performance gap. Moreover, on a set of locomotion and navigation tasks, PRDC achieves state-of-the-art performance compared with existing methods. Code is available at https://github.com/LAMDA-RL/PRDC

artificial intelligence, machine learning, reinforcement learning, (10 more...)

2306.06569

Country:

North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)